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ABSTRACT 

Social networks play a fundamental role in the diffusion of infor- 
mation. However, there are two different ways of how information 
reaches a person in a network. Information reaches us through con- 
nections in our social networks, as well as through the influence of 
external out-of-network sources, like the mainstream media. While 
most present models of information adoption in networks assume 
information only passes from a node to node via the edges of the 
underlying network, the recent availability of massive online social 
media data allows us to study this process in more detail. 

We present a model in which information can reach a node via 
the links of the social network or through the influence of external 
sources. We then develop an efficient model parameter fitting tech- 
nique and apply the model to the emergence of URL mentions in 
the Twitter network. Using a complete one month trace of Twitter 
we study how information reaches the nodes of the network. We 
quantify the external influences over time and describe how these 
influences affect the information adoption. We discover that the in- 
formation tends to "jump" across the network, which can only be 
explained as an effect of an unobservable external influence on the 
network. We find that only about 7 1 % of the information volume 
in Twitter can be attributed to network diffusion, and the remaining 
29% is due to external events and factors outside the network. 
Categories and Subject Descriptors: H.2.8 [Database Manage- 
ment]: Database Applications - Data mining 
General Terms: Algorithms, theory, experimentation. 
Keywords: Diffusion of innovations. Information cascades. Infor- 
mation diffusion. External influence. Twitter, Social networks. 

1. INTRODUCTION 

Networks represent a fundamental medium for the emergence 
and diffusion of information 1231 . For example, we often think of 
information, a rumor, or a piece of content as being passed over the 
edges of the underlying social network [22,, {29\. This way informa- 
tion spreads over the edges of the network like an epidemic llSj 
However, due to the emergence of mass media, like newspapers, 
TV stations and online news sites, the information not only reaches 
us through the links of our social networks but also through the in- 
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Figure 1: Our model of external influence. A node (denoted 
by a big circle) is exposed to information through an external 
source (governed by external activity \ext{t)) and by already 
infected neighbors (governed by the internal hazard function 
Aini (t)). With each new exposure x, the probability of infection 
changes according to the exposure curve ri{x). We infer both 
the external activity \ext(t), as well as the exposure curve ri(x). 

fluence of exogenous out-of-network sources jQ. From the early 
stages of research on news media and, more generally, information 
diffusion, there has been the tension between global effects from 
the mass media and local effects caixied by social structure 1201 . 

Traditionally, it was hard to capture and study the effects of mass 
media and social networks simultaneously II16I . However, the Web, 
blogs and social media changed the traditional picture of the di- 
chotomy between the local effects carried by the links of social 
networks and the global influence from the mass media. Today, 
mass media as well as the social networks both exist in the same 
Web "ecosystem," which means that it is possible to collect mas- 
sive online social media data and at the same time capture the ef- 
fects of mass media as well as the influence arising from the social 
networks 1211 . This allows us to study processes of information 
diffusion and emergence in much finer detail than ever before. 

In this paper, we ask the question "How does information trans- 
mitted by the mass media interact with the personal influence aris- 
ing from social networks?" Based on the complete one month Twit- 
ter data we study ways in which information reaches the nodes of 
the Twitter network. We analyze over 3 billion tweets to discover 
mechanisms by which information appears and spreads through the 
Twitter network. In particular, we contrast two main ways by which 
information emerges at the nodes of the network: dijfusion over 
the edges of the network, and external influence, when informa- 
tion "jumps" across the network and appears at a seemingly ran- 
dom node. For example, when information appears at a node that 
has no connections to nodes that have previously mentioned the in- 
formation, the emergence of information at that node can only be 



explained by the influence of some unobserved exogenous source. 
However, when information appears at a node with a neighbor that 
already tweeted it, then it is not clear whether the node tweeted the 
information due to neighbor's influence or due to the influence of 
the exogenous source. Thus, the effect of internal and external in- 
fections get confounded (4] and the goal of the paper is to develop 
models that will allow us to separate the influence transmitted by 
social networks from the influence of the exogenous source(s). 

Effects of external influence. On Twitter, users often post links 
to various webpages — most often these are links to news articles, 
blog posts, funny videos or pictures. Generally there are two fun- 
damental ways how users leam about these URLs and tweet them. 
One would be due to the exogenous out-of-the-network effects. For 
example, one can imagine a scenario, where one checks news on 
CNN.com, finds an interesting article and then posts a tweet with 
a URL to the article. In this case CNN is the "external influence" 
that caused that URL to emerge onto a particular Twitter user. At 
contrast, users can also come across URLs by seeing them posted 
by other users that they follow. This type of user-to-user exposure 
is what we refer to as "internal influence," or diffusion. We find 
that both external and internal influence play significant role in the 
emergence of URLs in the Twitter network. 

Modeling the external influence. In order to accurately model the 
emergence of content in Twitter we need to consider the activity of 
the invisible out-of-network sources that also transmit information 
to the nodes of the Twitter network (via channels, like TV, newspa- 
pers, etc.). We present a probabilistic generative model of informa- 
tion emergence in networks, in which information can reach a node 
via the links of the social network or through the influence of the 
external source. Developing such a model is important. For exam- 
ple, we simulated a purely non-diffusive process that picks nodes 
of the Twitter network at random and 'infects' them. After such 
process infects 10% of the nodes, about 30% of infections (falsely) 
appear to be a result of diffusion, i.e., the random process picks 
a node that has (simply by chance) an already infected neighbor. 
Thus, instead of estimating the amount of internal influence at 0%, 
naive estimate would be 30%. Due to the confounding of diffusion 
and external influence, we aim to separate the two factors. 

In our model (Figure [T) we distinguish between exposures and 
infections 1241 . An exposure event occurs when a node gets ex- 
posed to information J, and an infection event occurs when a node 
posts a tweet with information I. Exposures to information lead 
to an infection. A node can get exposed to information in two dif- 
ferent ways. First, a node U gets exposed to or becomes aware of 
information / whenever one of his neighbors in the social network 
posts a tweet containing / (we call this an internal expo.sure). The 
second way U can be exposed to I is through the activity of the 
external source (we refer to this as external exposure). We refer to 
the volume of external exposures over time as the event profile. In 
order to establish the connection between exposures and infections, 
we define the notion of the exposure curve that maps the number 
of times node U has been exposed to / into the probability of U 
getting infected 1241 . Distinguishing between exposures and infec- 
tions, and explicitly modeling the exposure curve allows us to cap- 
ture rich effects. For example, during the diffusion of a news stoiy, 
the story may become stale and less relevant each time a user sees 
it, so the probability of infection would decrease with each expo- 
sure. On the other hand, exposures to a story about new technology 
may have the opposite effect; with each exposure the user learns 
more about the technology so the probability of infection would in- 
crease. Exposure curves allow us to model such diverse behaviors 
that our model is able to accurately estimate from the data. 



Furthermore, we also develop an efficient parameter estimation 
technique. We are given a network and a set of node infection 
times. We then infer the event profile, which quantifies the num- 
ber of exposures generated by the external source over time. We 
also infer the exposure curve that models the probability of infec- 
tion as a function of the number of exposures of a node. Our model 
accurately distinguishes external influence from network diffusion. 

We experiment with our model on Twitter and find that we can 
accurately detect the occurrence external out-of-network events, 
and the exposure curve inferred from our model is often 50% more 
accurate than baseline methods. We find even though we are study- 
ing processes intrinsic to the Twitter network, only about 7 1 % of 
the content that appears in Twitter can be attributed to the diffu- 
sion through the edges of the Twitter network. We fit our model 
to 18,186 different URL's that have appeared across Twitter users, 
and we use the inferred parameters of the model to provide insights 
into the mechanics of the emergence of these URLs. Moreover, we 
also perform per topic analysis and find that topics, like Politics and 
Sports, are most heavily driven by the external sources, while En- 
tertainment and Technology are driven internally with only ^ 18% 
of exposures being external. 

2. RELATED WORK 

Work on the diffusion of innovations 1231 provides a concep- 
tual framework to study the emergence of information in networks. 
Conceptually, we think of an (often implicit) network where each 
node is either active (infected, influenced) or inactive, and active 
nodes can then spread the contagion (information, disease) along 
the edges of the underlying network. A rich set of models has been 
developed that all try to describe different mechanisms by which 
the contagion spreads from the infected to an uninfected node (8) 
|10II22||24..29J . However, nearly all models only focus on the dif- 
fusive part of the contagion adoption process, while neglecting the 
external influence. In this regard our work introduces an important 
dimension to the diffusion of innovations framework, where we ex- 
plicitly model the activity and influence of the external source. 

External influence in networks has been considered in the case 
of the popularity of YouTube videos II II . Authors considered a 
simple model of information diffusion on an implicit completely 
connected network and argued that since some videos became pop- 
ular quicker than their model predicted, the additional popularity 
must have been a result of external influence. Our approach differs 
significantly: We directly consider the network and the effect of 
node-to-node interactions, explicitly infer the activity of external 
source over time and use a much more realistic model of infor- 
mation adoption that distinguishes between exposures to and the 
adoption of information. Our model builds on the notion of expo- 
sure curves which was proposed and studied by Romero et al. 0241 . 
Recently, it was also argued |26l that it is the shape of exposure 
curves that stops the information from spreading. We make a step 
forward by providing an inference method that infers the shape of 
such exposure curves. Simulations show that our method much 
more accurately infers the exposure curves than the methods previ- 
ously proposed II24II26I . 

3. PROPOSED MODEL 

Here, we develop in detail our novel information diffusion model 
that incorporates both the spread of information from node to node 
along edges in the network as well as the external influences act- 
ing on the network. Additionally, our model reconciles the gap 
between a stream of exposures arriving in continuous time and a 
series of discrete decisions leading to infection. 



We refer to the amount of influence external sources have on the 
network as a function of time as the event profile. It is proportional 
to the probability of any node receiving an external exposure at a 
particular time. We use the term contagion to refer to a particular 
piece of information emerging in the Twitter network and we say 
a node is infected with a particular contagion when she first men- 
tions/tweets the contagion. We model contagions as independent 
of each other, which means we consider them one by one. 

We illustrate our model in a node-centric context in Fig. [T] As- 
sume a single contagion (i.e., a piece of information). As time pro- 
gresses, a node receives a stream of varying intensity of external 
exposures, governed by the event profile Xextit). Additionally, its 
neighbors in the network also become infected by the contagion, 
and each infected neighbor generates an internal exposure. Each 
exposure has a chance of infecting the node, but with the arrival 
of each exposure, the probability of infection changes according to 
the exposure curve riix). Eventually, either the arrival of exposures 
will cease, or the node will become infected and then expose to its 
neighbors. Our goal is to infer the number of exposures generated 
by the external source over time, as well as the shape of the expo- 
sure curve riix) that governs the probability of node's infection. 

Modeling the internal exposures. Consider a single contagion. In 
our model, an internal exposure occurs when a neighbor of a node 
becomes infected, and then an exposure is transmitted after a ran- 
dom interval of time. Imagine a real world scenario in which the 
social network is the Twitter network and the contagion spreading 
across the network is a particular URL. If a neighbor writes a tweet 
involving a particular URL then a user sees their neighbor's tweet, 
then and only then has the internal exposure propagated along the 
edge. An infected node will expose each of its outgoing neighbors 
exactly once, and the time it takes for each exposure to occur is 
sampled from some distribution universal to all edges in the net- 
work. Therefore, a hazard function 0121 is appropriate to model 
this process. Hazard functions were originally developed in actu- 
ary sciences, and they describe a distribution of the length of time 
it takes for an event to occur. Recently, ||13| used hazard functions 
as a basis for disease propagation in continuous time across social 
networks. They are extremely effective at modeling discrete events 
that happen over continuous time. In this respect hazard functions 
represent a principled way of occurrence of discrete events {i.e., 
exposures) as a function of continuous time. 

Specifically, let Ai„t be the internal hazard fiinction, where 
Xint{t) dt = P{i exposes j G [t, t + dt) \ i hasn't exposed j yet) 

for any neighboring nodes i and j, where t is the amount of time 
that has passed since node i was infected. In our context, Ai„t 
effectively models how long it takes a node to notice one of its 
neighbors becoming infected. It is a function of the frequency with 
which nodes check-up on each other. For the Twitter network, each 
time a user logs-in they are updated on all of their neighbors. 

The expected number of internal exposures a node i has received 
by time t, which we will define as A'^'j {t), is the sum of the cumu- 
lative distribution functions of exposures propagating along each of 
the node's inbound edges and can be derived as follows: 



E 



P(j exposed i before t) 



(1) 



r,j is i's inf. neighbor 

E 

i;i is r's inf. neighbor 



1 — expf— y \int[s — Tj)ds 



Modeling the external exposures. The second source of expo- 
sures to a particular single contagion for nodes in the network comes 
from the external source acting on the network. The fundamental 
property of the problem we are trying to solve is that the external 
source cannot be observed. The source varies in intensity over time, 
and this function is called the event profile, which we designate as 
\ext{t). Specifically, 

Xext{t) dt = P{i receives exposure G [t,t + dt)) 

for any node i, where t represents the amount of time since the 
contagion first appeared in the network. A couple of things should 
be noted here. First, all nodes have the same probability of receiv- 
ing an external exposure for any point in time. Second, Xext is not 
conditioned upon the node not already having received an external 
exposure. This means that any node can receive an arbitrary num- 
ber of external exposures. We call X^xt the event profile because it 
describes an actual real world event that caused the information to 
arrive in the network and start spreading. As the event progresses 
over time, event's efficacy in the network changes. For example, 
if our contagion is civil unrest in Libya, then every time their ruler 
Gaddafi gives a speech or the rebels win a battle we would expect a 
spike in the intensity of the external source and thus the even pro- 
file Xext . As time passes without any new developments or as the 
event's relevancy fades, we expect X^xt decrease to 0. However, 
every time there is a new development we expect a spike in the ex- 
ternal event profile Xext. We will infer Xext non-parametrically, so 
we can quantify the relevancy of any event over its lifespan. 

In order to derive the distribution of exposures a node receives 
over time as a function of time, we model the arrival of exposures as 
a binomial distribution. Consider we were to take the entire contin- 
uous time interval of the lifetime of the contagion and break it down 
into smaller but finite time intervals. Then whether an exposure oc- 
cuiTed during each such subinterval is a Bernoulli random variable 
(exposure vs. no exposure) with its own probability. Therefore, the 
total number of exposures received in a time interval is a sum of 
Bernoulli random variables, just as a binomial random variable is a 
sum of Bernoulli random variables. Let's say that Xext is constant 
for all time and that time is discretized into finite intervals of length 
At. Then the probability that n external exposures have been re- 
ceived after T time intervals is exactly a binomial distribution: 



Pexp{n;T ■ At) = ( ^ ) (Xext ■ At)" ■ {1 ~ Xext ■ At)^'" 



Set t = r ■ At. If we take the limit of as At ^> and T -5> cxj 
such that t does not change, then this probability approaches 



^ {Xext ■ dt)" ■ (1 - Xext ■ dt)*/*- 



(2) 



where is the infection time of node j. 



Pexp (tl, t) 



To relax the constraint that Xext is constant, we use the average of 
Xext{t) overt: 

P« (n; t) . (^/^^) f^.dtY.fl-^. dt' '"'-^ 



where Aej;t(t) = j^^ Xext{s)ds. Finally, users are receiving both 
external and internal exposures at the same time, so if we need take 
into account both processes. This would imply taking the convo- 
lution of the two probabilities, which would be computationally 
infeasible. Instead, we use the average of Xextit) + X^^^^(t): 



Symbol 


Name 


Description 


Technical Definition 




The Event profile 


Proportional to the probability of any 
node receiving an exposure at time t. 


^ext{t) at — F[ node exposed G lt,t + dt)\ 




Internal Hazard 
Function 


Governs the random amount of time it 
takes an infected node to expose its 
neighbors 


Xint (t) dt = 

P (i exposes j G [t,t + dt\ i hasn't exposed j yet) 




The Exposure Curve 
(parameters pi, P2) 


Determines how the probability of 
infection changes with each exposure. 


77(2;) = P{ infected right after x*'^ exposure) 


Pe xp ( ^ ) ^ ) 


The Exposure Distribution 


The probability that node i has 
received n exposures by time t 


p!^p = ') ( ""^ t ""'^ ' ■ At] 

/ li) \ t/At-n 


Ti 


Infection time 


The infection time of node i 





Table 1: Definition of symbols used in the model. 



t/dt \ I A^M + J^extit) 



n I \. t 



^ AZ{t)+Aextit) 



dtj (3) 

t/dt — n 

(4) 



Effectively, we approximated the flux of exposures as constant in 
time such that each interval of time has an equal probability of an 
exposure arriving, so the sum of the events is a standard binomial 
random variable. 

Modeling the exposure curve. We model the exposure curve as 
a parameterized equation. Recall that the exposure curve describes 
the probability of infection as a function of the number of exposures 
received. More specifically, if x is the current number of exposures 
the node has received and rj{x) is the exposure curve, then 

rj{x) = P(node i is infected immediately after a;*'' exposure). 

We choose to parameterize ri{x) as 

/ X Pi f T ^ 

T](x) = — ■ X ■ exp 1 

P2 V P2 

where pi £ (0, 1] and p2 > 0. Parameterizing 7i{x) in this manor 
allows for several desirable properties. First, 77(0) = so it is 
impossible to become infected by a contagion before being ex- 
posed to it. Secondly, this function is unimodal with an exponential 
tail, so there is a critical mass of exposures when the contagion is 
most infectious followed by decay brought on by the idea becom- 
ing overexposed/tiresome. Lastly, and most importantly, pi and 
P2 have important conceptual meanings: pi = max^, 77(2;) and 
P2 = arg maxa; rj{x). Because of this, we can think of pi as a gen- 
eral measure of how infectious a contagion is in the network and p2 
as a measure of the contagion's enduring relevancy. Fig. |2]shows 
several different forms of ri{x). This parameterization is expres- 
sive, but any other parameterization for 77(2;) is also valid. For the 
remainder of the paper, we will discuss the model in the context of 
the 77(3-) parameterization presented above. 

From exposures to infections. In order to fit the parameters of 
the model to observed data, we must now construct the probability 
functions to describe the model. With the equations given above, 
building the distribution of the infection time of a node i can be 




P2 

Exposures 

Figure 2: Example exposure curves r]{x), where ti{x) is the 
probability of a node becoming infected upon its x"* exposure 
to the contagion. The parameters of r7(x) are pi and p2. 

done as follows. Let F'^'^t) = P{Ti < t) be the probability that 
node i has been infected by time t, where Ti is the infection time of 
node i. Making use of the quantity Piip(n; t). 



P''^^ (*) = ^ P[j has n exp. ] x P[i inf. \i has n exp. ] (5) 



(6) 



While F^^\t) is analogous to the cumulative distribution func- 
tion of the infection probability, it is important to note that it is not 
acf!/aZ/y a distribution; limt^oo J"(t) < 1 as aresult of limj^^oo 7?(a;) 
0. This is ideal because it implies that there is a non-zero chance 
that a node will never become infected, as should be the case. 

3.1 Inferring the model parameters 

Next we develop a method of inferring the model parameters for 
a given network and the tract of a single contagion. We fit the model 
to each contagion separately. We are given the network and the 
infection times for each node that got infected with the contagion 
under consideration. We then need to infer the event profile Xext (t) 
for all t at which at least one node was infected, and parameters 
of rj{x), pi and p2, of the exposure curve. In all, the number of 
parameters we are inferring is the number of unique node infection 
times plus the two parameters of 77(2;). Our general strategy is to 
alternate back and forth from inferring \ext{t) to inferring 77(1), 
assuming we known one for certain while we infer the other, until 
both functions converge. Below, we first demonstrate how to infer 
the event profile when the exposure curve is known. Then, we show 



how to infer the exposure curve with a known event profile. Finally, 
we combine the two steps into a single algorithm. 

Inferring the event profile. The following outlines a fast and ro- 
bust method for inferring Xext{t), given ri{x). Let S{t) be the 
number of nodes that are uninfected (by the contagion currently 
under consideration) at time t. S{t) is a random variable whose ex- 
pectation value is dependent on Xextit), and the underlying 
network. The networks which we are interested in are sufficiently 
large, so the quantity S{t) — E [S{t)\ is usually very small in mag- 
nitude. This provides us with a very straight-forward method for 
inferring A.^xt{i) = Jq Xextis)ds. Let be the fc*'* time at which 
at least one node was infected, then define as Aext{tk)- To 
calculate S{t), 



S(tk) = P{ node i not infected by time t) 



(V) 



N oc n 
i=l n=l fc = l 

ife) exp - / v{y)dy (9) 

f«2Ze^p(-y^ viy)dyj- (10) 

The first approximation comes from treating the number of expo- 
sures received by a node at any given time as a continuous real num- 
ber instead of an integer. This provides us with a closed-form ex- 
pression. The second approximation comes from setting the num- 
ber of exposures received by each node to be the expected number 
of exposures. 

Since the right-hand side is monotonic (it is strictly decreasing 
with respect to Afe), we can solve for using bisection search. 
Doing this for all tk gives us Aext {tk) for each possible time, and 
then we can use finite difference to get \ext{tk)- 

Once the event profile has been inferred, we must then update 
the exposure curve accordingly. 

Inferring tlie exposure curve. Now, we assume we know Aext{t) 
for all tk, and we want to infer the exposure curve r]{x), specif- 
ically its parameters pi and p2. Our strategy in solving for these 
parameters will be to fix p2, and then solve for a pi that maximizes 
the following approximation to the log-likelihood. Making use of 
Eq. [6] we have 

rd[F«(t)] 



C{r),KextMnt) = Elog 



dt 



+ ^log 



log(r?(n)) + ^log(l-r?(fc)) 



where X is the set of all infected nodes, X'^ is the set of all unin- 
fected nodes, and Tmax is the time of the last observed infection. 
The optimal pi satisfies = so 



o = - + EE^-U";-™^=^)-E 



ri{k) 



Pi 



iel n=l 



- pi ■ (1 - r,(fc)) 
V{k) 



pi ■ (1 - vik)) ■ 



(11) 



(12) 



The parameter pi can be solved iteratively, using and initial value 
between and 1. Because Pe$p is independent of pi, they only 
need to be calculated once. This, along with the iterations converg- 
ing quickly, makes this entire process very fast. 

Now, we combine the event profile inference process with the 
exposure curve inference process to form a single algorithm that 
infers the entire model. 

Inferring all parameters. If we use the previously mentioned 
method to infer ri{x) using the actual ground-truth Aext (i), it works 
extremely well. In fact, coming up with contrived instances in 
which it breaks is difficult. The same thing is true for using the 
event profile inference method with ground-truth r]{x). When nei- 
ther ground-truth function is known and we have to iterate back 
and forth between both methods, however, the results are not as 
stable. Both functions' inference methods are sensitive to errors in 
the other function. Fortunately, all that is needed to correct this is 
a slight modification. Simply put, we fix p2 to some integer value 
and then iterate back and forth between the two methods. Then, 
pi and Aext{t) converge to some values dependent on the fixed p2, 
and we calculate the log-likelihood of the resulting infeixed func- 
tions. We do this for all reasonable integer values of p2, and we 
choose the one with the optimal log-likelihood. Algorithm [T] gives 
the pseudocode. 

Algorithm 1 Model Parameter Inference 

Initialize A^xtit), p}inai' P/ 

for P2 = 1 ^> pmax do 

Initialize pi 

while not converged do 

pi <— Solution to Eq.ll2lusing p2, A^xtit) 
Ai.xt(t) Solution to Eq.llOlusing pi, p2- 

end while 

C <— Log-Likelihood{Asxt (t), Pi, P2) 
if £ > Cmax then 

^?nax ^ £ 

P final Pl 
2 

P final ^ P2 

end if 
end for 



Aextit) <— Solution to Eq.[TO]using p^j. 



nah P final ' 



Practical considerations. Since we infer the event profile Xext (t) 
in non-parametric form, the number of parameters in the model 
could potentially scale with the time duration of the contagion (we 
would have to solve for \sxt{ti) for each node's infection time 
ti). This can be prevented, however, by predetermining a set of 
times {tm}^_-^ only at which the event profile will be inferred. 
Then, Xext{t) between these set times can be approximated using 
linear interpolation. In practice, we used AI = 20, and we set each 
tm at the time in which jj of the infections with the contagion 
have occurred. Doing this not only makes the runtime constant 
with respect to the duration of the contagion, it also speeds up the 
algorithm in general at the price of only a negligible decrease in 
accuracy. 

The algorithm scales linearly with the number of nodes that re- 
ceived at least one exposure. All nodes that received only external 
exposures and no internal exposures, however, are effectively iden- 
tical and can be grouped into a single term for both the event profile 
inference and the exposure curve inference. Therefore, in practice 
the runtime scales linearly with the number of nodes that received 
at least one internal exposure, i.e. the union of outgoing neighbor- 
hoods for all infected nodes. For most real world social networks. 
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Figure 3: Experiments on synthetic data, (a)-(e) The model fit- 
ted to a synthetic contagion on a scale-free network with 75,879 
nodes. The internal hazard function is Xint (t) = t, which in- 
duces a Raleigh (unimodal) distribution for the internal expo- 
sure propagation time. Given just the number of infections (a) 
our model is able to infer all of (b)-(e). (f)-(j) The model fit- 
ted to the same network but with the internal hazard function 
Xin.t(i) = 7? which induces a power law distribution for the 
internal exposure propagation time. 

this implies the runtime scales slightly more than linearly with re- 
spect to the number of infections. 

We can infer the model parameters for most contagions well in- 
side a minute. A large portion of real-world contagions in our 
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Figure 4: The model fitted to a single contagion representing 
URLs related to the Tucson, Arizona shootings. The green ver- 
tical lines designate when four distinct developments related to 
the shooting event occurred. 

dataset infects about 50-100 nodes, and rarely did the algorithm 
take more than 10 seconds to converge. For larger contagions, some 
infecting thousands of nodes, the runtime was 5-10 minutes. 

In all, we used the algorithm to fit the model to more than 18,000 
real contagions and hundreds of synthetic contagions, and we never 
encountered convergence issues. 

4. EXPERIMENTS 

With our model well-defined and with an algorithm for infer- 
ring its parameters, we now apply it to real as well as synthetic 
data. First, to establish the accuracy of the parameter inference 
algorithm, we fit our model to synthetic data. This allows for di- 
rect comparison of ground-truth to inferred parameters. We exam- 
ine a specific real-world case study to better illustrate the model. 
Lastly, we run a series of large-scale experiments on the emergence 
of Twitter URLs. The model reveals the underlying dynamics of 
information emergence on Twitter. 

4.1 Experiments with synthetic data 

To test accuracy of the model parameter inference algorithm, we 
run a series of experiments on simulated data. 

For each experiment, we first generate a large synthetic preferen- 
tial attachment network. We then choose values for 7']{x), \ext{t), 
and Aint (i ) • At the start of the experiment, all nodes are uninfected. 
Then, using a small discrete time step At we march forward in 
time, and external exposures are sent to each node with probability 
Xext (t) ■ At. If a node becomes infected, it will transmit exactly 
one exposure to each of its outbound neighbors, and the time each 
outbound exposure takes to propagate is governed by Xint (t) ■ At. 
With each exposure a node receives, we sample a binary random 
variable with bias ri{x) to determine whether the node will become 
infected upon that exposure. Once the experiment is complete, the 
algorithm is given a set of node infection times, the underlying net- 
work, and \int{t), and its task is to infer r;(x) and Xextit). 

Baselines. We compared our algorithm against common sense 
baselines. For infeixing ri{x), we used the baseline of assuming 
internal exposures propagate immediately, and that all exposures 
originate internally. Calculating rj{xk) at each exposure count Xk 
then boils down to counting the fraction of times a node becomes 
infected immediately after Xk of its neighbors become infected. 
Note this is exactly the method of infeixing r;(x) used in |24|. The 
baseline for inferring Xext{t) uses the number of infections that 
occur for each unit of time in which none of the node's neighbors 
were previously infected. We refer to these infections as external 
infections. Since an externally infected node, by definition, has no 
infected neighbors, we know with certainty that all exposures the 
node received came from the event profile. Therefore, the arrival 



of external infections over time should be indicative of the arrival 
of external exposures over time, i.e., the event profile. This, how- 
ever, only provides a shape (but not the scale) of the event profile, 
because without knowledge of the exposure curve r]{x), we do not 
know how many exposures it takes to typically cause an infection. 
Thus, the scale of the baseline X^xt it) is usually 1 to 2 orders of 
magnitude larger. 

Experimental results. We ran many different combinations of net- 
work topologies, exposure curves, event profiles, and internal haz- 
ard functions. Overall, we ran over 100 different combinations on 
networks of 75k-l- nodes, and the algorithm not only performed con- 
sistently well but also did significantly better than the baselines. We 
included the results of two such experiments in Fig. |3] 

For the first experiment, our algorithm is given a network and 
the data on Figure [3(a)] Based only on this information, it is able 
to infer data shown in Figures |3(b)| to |3(e)| These figures illus- 
trate various aspects of the infeiTed profile of the external influence, 
i.e., the event profile, and exposure curve against the ground truth 
and the baselines. For the event profile, not only is the scale of 
the baseline off by several orders of magnitude, but it also places 
the peak of the event profile far too early. On the other hand, the 
event profile inferred using our algorithm very closely predicts the 
scale, shape, and the occurrence of the profile peak to the extent 
that the difference between the ground truth and inferred event pro- 
file is negligible. The situation is the same for inferring the ex- 
posure curve in Fig. |3(e)| The inferred rj[x) almost exactly fits 
the ground truth, whereas the baseline overestimates the exposure 
curve by more than 50%. 

For the second experiment (Fig. |3(b)|3(j)) , we used a very pecu- 
liar zig-zag ground-truth external influence profile (Fig. |3 (b)[ ), but 
the observations are still the same — our model was able to infer 
all the quantities almost exactly. The event profile inference shown 
in Fig. |3(i)| is very accurate. It resolves each of the 10 peaks, while 
the baseline, besides being orders of magnitude off in scale, only 
detects 4 peaks. We infer r]{x) almost exactly, as shown in Fig. |3(j)[ 

Note that even though we test the algorithm on synthetic data 
the fact that the model works well is not at all trivial. In particu- 
lar, from the model fitting point of view the effects of internal and 
external influence are confounded and the model estimation proce- 
dure needs to separate them out. In particular, consider the contrast 
in the performance of the baseline approaches and the proposed 
model. Overall, these experiments demonstrate the robustness of 
the model and allow us to move to the experiments on real data. 

4.2 Experiments Using Real Data 

We now fit our model to a real data from the Twitter network. 
We study the emergence of URLs on the Twitter network. URLs 
emerge by Twitter users mentioning them in their tweets (through 
tweeting or re-tweeting). Thus, URLs correspond to contagions, 
posting a tweet mentioning a particular URL corresponds to an in- 
fection event. 

Twitter dataset. To apply our model to a real-world information 
diffusion network, we collected complete Twitter data for January 
2011, which consists of 3 billion tweets. We focus on URLs that 
have been tweeted by at least 50 users as our contagions of study 
(we found that contagions smaller than 50 infections did not pro- 
vide robust enough statistics). For URLs that were shortened, we 
unshortened them and treated all URLs that point to the same web 
address as one contagion. We restricted our focus to URLs in which 
we could classify as written in English. To do this, we extracted 
natural text from the HTML of the URLs and then used a charac- 
ter sequenced-based classifier to determine their language (7). We 



also removed URLs that demonstrated blatant spamming behavior. 
In all, this resulted in 18,186 different URLs. 

We constructed the network over which these URLs propagate 
as follows. First, we took the union of all users that tweeted at 
least one of these URLs. Then, for each user in this set we used 
the Twitter API to extract a list of the users that they follow. When 
one user follows another, he/she can see all of their tweets, include 
URLs that they post, and it is through this relationship that con- 
tagions spread on Twitter. In all, this created a 1,087,033 node 
subgraph with 103,1 12,438 edges. We focus our study on URLs as 
they clearly emerge due to external events. 

For the internal hazard function Xint{t), empirical analysis indi- 
cates that Xint{t) = where t is in hours, is a suitable choice. 
This implies that the distribution of lag time between infections and 
exposures follows a power law with an exponent of 1.14. 

A case study of the influence of external events. We start our 
investigations on real data with an illustrative case study. Using in- 
formation diffusion, we aim to detect a sequence of external events 
that presumably caused bursts of activity on the Twitter network. 

We examined the Tucson, Arizona shooting on January 8*'' in 
which 6 people were killed and 14 others were injured, and among 
the injured was U.S. Congresswoman Gabrielle Giffords. There 
were four key developments related to this event: (1) the shoot- 
ing occurs (Jan. 8, 10:10am), (2) the Westboro Baptist Church 
announces plans to protest at the funerals of the victims (Jan. 9, 
9: 15am), (3) Arizona Governor Jan Brewer signs emergency legis- 
lation blocking the protest. (Jan. 11, 9:24am), and (4) an on-line 
"Get Well Soon" card is formed for Gabrielle Giffords that people 
can sign (Jan. 12, 6pm). 

We collected all URLs that were tweeted at least 50 times that 
contained the word "Giffords." We then gathered them into a single 
contagion. Given that we aggregated four separate sub-stories we 
would expect that when we fit our model to the observed data, the 
event profile would coincide with developments related to the real- 
world event. Indeed this is the case as shown in Fig. [4] 

The results of the model applied to the contagion are shown in 
Fig. |4] Additionally, the time of each of the 4 developments listed 
above is represented as a vertical green line in Fig |4(a)| Our model 
clearly detects all four developments: each of them is followed by 
a spike in the event profile within 10 hours. For the second two de- 
velopments, the spikes in the event profile are immediate. Also in- 
teresting is how the baseline event profile differs from the model's. 
For example, immediately after the 3rd development (i.e., when the 
governor passed a new law) the model infers two spikes in Xext{t) 
whereas the baseline records only one. In response to the law being 
passed, many different groups began organizing counter protests 
to prevent the Westboro Baptist Church from interfering with the 
funerals. This created a second influx of URLs from sources ex- 
ternal to Twitter (Facebook groups, news sites, etc.), which was 
completely missed by the baseline. 

Evaluation using Google Trends. As a global alternative eval- 
uation method we also performed the experiment where we ex- 
tracted a set of mainstream media articles for which we were able to 
identify a single keyword W that adequately describes them {e.g., 
swine flu for a BBC article on "Increase in Northern Ireland swine 
flu cases"). For each W, we then queried Google Trends to ob- 
tain the number of worldwide search traffic of query W over time. 
This served as a proxy for the activity of the external source. We 
compared the L2 distance between the infeiTed event profile and 
the Google Trends ground-truth. Overall, we found that our model 
gives 30% relative improvement in the L2 distance of the inferred 
event profile when compared to the naive event profile estimation. 



External influence of different news categories. We now proceed 
to an aggregate analysis of event profiles and external influence of 
different category of news. We identified 9 news sites that specify 
the article's category within the URL. All together, we identified 
1,929 URL's belonging to 11 different news categories. We then 
fit our model each URL and infer the event profile as well as the 
exposure curve. For each news category, we then calculated the 
average pi which is the maximum probability of infection for the 
exposure curve, p2 which is the number of exposures at which the 
URL is most infections, the duration or lifetime over which the 
event profile was inferred, and the number of expected total exter- 
nal exposures each node receives from the URL's event. 

The results are displayed in Table 2. The average value of pi 
was 0.0013, p2 was 3.21, the average duration of the contagions 
was 65.69 hours, and the average fraction of external infections 
was 23.94%. In the first column, we show the maximum probabil- 
ity of infection for the exposure curve. Notice that Entertainment, 
Business, and Health appear to be the most infectious, where Art, 
Education, and Travel are the least infectious. This seems reason- 
able as news articles about topics such as Art or Education would be 
less likely to be retweeted compared to Entertainment articles. The 
second column describes upon which exposure the URL is most in- 
fectious. World News, which is more time sensitive, reaches max- 
imum infectiousness earlier compared to other topics. After a user 
has received more than p2 exposures, the probability of infection 
decreases, so it makes sense that these topics, which become irrel- 
evant as time passes, reach this point sooner. Contrast this with 
a topic like Art that is naturally less temporally sensitive. Addi- 
tionally, we learn that topics with a smaller p2 tend to have shorter 
duration, and topics with a larger p2 tend to have infections appear 
over a longer interval of time. Intuitively this makes sense as topics 
related to events (World, Business) get "old" sooner. 

Lastly, the last column shows on average what percent of expo- 
sures came from external sources versus from within the network. 
Politics appear to be the most externally driven topic, while Enter- 
tainment is the most internally driven. This consistent with the fact 
that the 22 of the top 30 users followed on Twitter are entertainers. 

Global characteristics. The distributions for both the pi and p2 
exposure curve parameters inferred across the entire URL dataset 
can be found in Figures |6(a)[ |6(b)| Interesting is for how low the 
values of pi were inferred, with a mode on the order of .0005. This 
implies that the people, at least Twitter users, are very selective 
about the ideas they adopt. Additionally, most of the inferred p2 
parameters were small, with p2 ~ 1 being the most common. Re- 
call that a smaller p2 implies that the probability of infection begins 
to decrease with additional exposures sooner, and from this we see 
evidence that users quickly fatigue of most diffusing contagions. 

Next, for each URL, we went through every user that was in- 
fected one by one. For each user, we plotted the order of infection 
of the user in relation to all other infections versus the fraction of 
expected exposures the user received from internal sources, and the 
results can be found in Figure [6(c)] This plot demonstrates the in- 
teresting time dynamics at play. On average, the first few users are 
infected almost purely externally, but then there is a surge in inter- 
nal exposures. As a result, the early infections are largely internally 
driven, but as the contagion continues to spread the infections are 
driven more and more by external influences. This initial surge in 
internally driven infections is also evident in the aggregated expo- 
sure curve, shown in Fig. 5. Upon each infection, the expected 
number of exposures the user has received is recorded and divided 
by the inferred value of p2. This value shows how far along the 
node was in the exposure curve when the infection occurred, and 



the apex of the exposure curve occurs when it is equal to 1. As 
one might expect, there is a high density of infections occurring 
at the apex. What is interesting, however, is that there is also a 
dense group of infections happening early in the exposure curve at 
low probabilities. This group is almost exclusively populated by 
internally infected users. 

Finally, for each URL we calculated the expected number of ex- 
posures each user received during the emergence of the URL and 
what fraction of these exposures came from an external source. Av- 
eraging across all URLs, we found that 71% of all exposures came 
from internal sources within the network, while the other 29% of 
the exposures were external. We find this 29% to be significant and 
clear evidence that external effects cannot be ignored. 

5. CONCLUSION 

Emergence of information has traditionally been solely modeled 
as a diffusion process in networks. However, we identified that only 
around 71% of URL mentions on Twitter can be attributed to net- 
work effects, and the remaining 29% of mentions seem to be due to 
the influence of external out-of-network sources. We then present 
a model in which information can reach a node via the links of the 
social network or through the influence of external sources. Ap- 
plying the model to the emergence of URLs in the Twitter network 
demonstrated that our model can be used to infer the shape of in- 
fluence functions as well as the effects of external sources on the 
information diffusion in networks. We should emphasize that our 
model does not only reliably capture the external influence but, as 
a consequence, also leads to a more accurate description of the real 
network diffusion process. 

For future work it would be interesting to relax the assumption 
of uniform activity of the external source across all nodes of the 
network. Incorporating our model into methods for identifying 
"influencers" in networks II18I |5] |9) might be fruitful. Currently, 
phenomena we are observing are clearly taking place in aggregate. 
Ultimately, it will be interesting to pursue more fine-grained analy- 
ses as well, understanding how patterns of variation at the level of 
individuals contribute to the overall effects that we observe. 
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